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Abstract 


Confidence-weighted online learning is a generalization of margin-based learning of linear classi- 
fiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution 
over classifier weights that is updated online as examples are observed. The distribution captures a 
notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing 
a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated 
by the statistical properties of natural-language classification tasks, where most of the informa- 
tive features are relatively rare. We investigate several versions of confidence-weighted learning 
that use a Gaussian distribution over weight vectors, updated at each observed example to achieve 
high probability of correct classification for the example. Empirical evaluation on a range of text- 
categorization tasks show that our algorithms improve over other state-of-the-art online and batch 
methods, learn faster in the online setting, and lead to better classifier combination for a type of 
distributed training commonly used in cloud computing. 


Keywords: online learning, confidence prediction, text categorization 


1. Introduction 


While online learning is among the oldest approaches to machine learning, starting with the percep- 
tron algorithm (Rosenblatt, 1958), it is still one of the most popular and and successful for many 
practical tasks. In online learning, algorithms operate in rounds, whereby the algorithm is shown a 
single example for which it must first make a prediction and then update its hypothesis once it has 
seen the correct label. While predictions traditionally take the form of either positive or negative 
labels (binary classification), algorithms have been extended to a variety of multi-class, regression, 
ranking and structured prediction problems. By operating one example at a time, online methods 
are fast, simple, make few assumptions about the data, and perform fairly well across many domains 
and tasks. For those reasons, online methods are often favored for large data problems, and they are 
also a natural fit for systems that learn from interaction with a user or another system. In addition to 
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their nice empirical properties, online algorithms have been analyzed in the mistake bound model 
(Littlestone, 1989), which supports both theoretical and empirical comparisons of performance. 
Cesa-Bianchi and Lugosi (2006) provide an in-depth analysis of online learning algorithms. 


Much of the machine learning in natural-language processing (NLP) is based on linear clas- 
sifiers over very high dimension sparse representations of the input trained on large training sets. 
These properties make online learning a natural choice. Extensions of online learning to structured 
problems (Collins, 2002; McDonald et al., 2004) achieved some of the best results in structured 
tasks such as part-of-speech tagging (Collins, 2002; Shen et al., 2007), text segmentation (McDon- 
ald et al., 2005a), noun-phrase chunking (Collins, 2002), parsing (McDonald et al., 2005b; Carreras 
et al., 2008), and machine translation (Chiang et al., 2008). Popular online methods for those tasks 
include the perceptron (Rosenblatt, 1958), passive-aggressive (Crammer et al., 2006a) and expo- 
nentiated gradient (Globerson et al., 2007). 


Online learning algorithms are typically used as blackboxes in NLP, without consideration of 
the peculiarities of natural language. Feature representations of text for tasks from spam filtering 
to parsing need to capture the variety of words, word combinations, and word attributes in the text, 
yielding very high-dimensional feature vectors, even though most of the features are absent in most 
texts. Nevertheless, those many rare features are very informative about the examples that contain 
them; indeed, features that occur frequently are typically less informative, hence the common use of 
stop-lists of frequent words such as function words, and of tf-idf term weighting.'! In Figure 1, we 
show the most predictive features for a simple NLP classification task and their frequency in data. 
Notice that while some predictive features are very common, most are relatively rare, indicating that 
modeling even infrequent features may be useful for learning. Therefore, it is worth investigating 
whether learning algorithms for linear classifiers could be improved to take advantage of these 
particularities of natural language data. 


The foregoing motivation led us to propose confidence-weighted (CW) learning, a class of online 
learning methods that maintain a probabilistic measure of confidence in each weight. Less confident 
weights are updated more aggressively than more confident ones. Weight confidence is formalized 
with a Gaussian distribution over weight vectors, which is updated for each new training example 
so that the probability of correct classification for that example under the updated distribution meets 
a specified confidence. The result is an algorithm with superior classification accuracy over state- 
of-the-art online and batch baselines, faster learning, and new classifier combination methods for 
parallel training. 

While our motivation for CW learning is from observations about NLP problems, the approach 
makes no assumptions about the input space and can be applied to other machine learning problems 
(Ma et al., 2009). 

This paper brings together two types of confidence-weighted algorithms originally introduced 
by Dredze et al. (2008) and Crammer et al. (2008). In addition to a unified presentation, we include 
alternative formulations of the diagonal covariance algorithms along with empirical results. We also 
include further empirical evidence of the strength of these methods and an analysis of algorithmic 
behavior on NLP problems. 





1. We note that data sparsity is different from model sparsity. Sparsifying regularizers, such as those that constrain 
the Lı norm of weight vectors (Andrew and Gao, 2007; Gao et al., 2007). are often proposed to remove redundant 
features in very high-dimensional data, but they are complementary to the methods we present here to learn better in 
the presence of many rare but relevant features. 
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Figure 1: The top quartile of negative (left) and positive (right) features as ranked by mutual infor- 
mation with the label for sentiment data (described in Section 7). The x-axis is their (log) 
rank by mutual information and the y-axis is their total (log) count in the data. While 
some very frequent features are useful for predicting the label (high on the curve) there 
are a large number of low frequency features (low on the curve) that are still useful for 
learning. A sparse model would likely remove these low frequency features despite their 
predictive value. 


We begin with a discussion of the motivating particularities of natural language data. We then 
introduce the confidence-weighted framework. From this framework we derive two types of al- 
gorithm following different formulations of the main constraint, each with a full covariance and 
several diagonalized versions. A series of experiments shows CW learning’s empirical benefits and 
an analysis reveals how algorithmic properties manifest themselves empirically. We conclude with 
a discussion of related work. 


2. Characteristics of NLP Data 


Extensive experience with building classifiers for a wide range of language processing tasks shows 
that correct classification requires many specific features, including the presence at specified po- 
sitions of particular words, affixes, or word combinations (such as bigrams) in the example to be 
classified. An individual example has a very small fraction of those features, but collectively, ex- 
amples to be classified may involve a very large number of features (10° — 10°), most of which 
only occur in a few examples. The vector representation of the typical example is a very sparse high 
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dimensional vector where only a small fraction of elements is nonzero, and feature frequencies have 
a heavy-tailed distribution (Figure 1). 


Online algorithms do well with large numbers of features and examples, but they are not de- 
signed specifically for very sparse examples with a heavy-tailed feature frequency distribution. This 
can have a detrimental effect on learning. Typical linear classifier training algorithms update the 
weights of binary features only when they occur. The result is many updates for frequent features 
and few updates for rare features. Similarly, features that occur early in the data stream take more 
responsibility for correct prediction than those observed later. The result is a model that could have 
good weight estimates for common features but inaccurate weights for the great majority of features, 
which occur relatively rarely. 


An illustrative case arises in sentiment classification. In this task, a product review is represented 
as n-grams and the goal is to label the review as being positive or negative about the product. 
Consider a positive review that simply read “J liked this author.’ An online update would increase 
the weight of both “liked” and “author.” Since both are common words, over several examples the 
algorithm would converge to the correct values, a positive weight for “liked” and zero weight for 
“author.” Now consider a slightly modified negative example: “I liked this author, but found the 
book dull.” Since “dull” is a rare feature, the algorithm has a poor estimate of its weight. An update 
would decrease the weight of both “liked” and “dull” The algorithm does not know that “dull” is 
rare and the changed behavior is likely caused by the poorly estimated rare feature (“dull”) instead 
of the well estimated common feature (“liked.”) An algorithm that maintains no information about 
the relative frequency or of second order information about features would attribute equal negative 
weight to both “liked” and “dull”, which slows convergence. 


This example demonstrates how a lack of memory for previous examples—a property that al- 
lows online learning—can hurt learning. A simple solution is to augment an online algorithm with 
additional information, a memory of past examples. Specifically, the algorithm can maintain a con- 
fidence value for each feature weight. For example, assuming binary features, the algorithm could 
keep a count of the number of times each feature has been observed or how many times each weight 
has been updated. The larger the count, the more confidence we have in the weight of that feature. 
These estimates are then used to influence weight updates. Instead of equally updating every feature 
weight for the on-features of an example, the update favors changing low-confidence weights more 
aggressively than high-confidence ones. At each update, the confidence in the weights of observed 
features is increased, which will focus the update on the low confidence weights. In the example 
above, the update would decrease the weight of “dull” but make only a small change to “liked” since 
the algorithm already has a good estimate of this weight. 


In the next section, we use this motivation from language data to present a new family of learning 
algorithms that associate a confidence value with each weight. For now, we wish to dispel two 
potential misinterpretations of the preceding very informal argument. First, while our approach is 
motivated by learning with sparse binary features with a heavy-tailed frequency distribution, the 
algorithms do not depend on those assumptions. Second, our notion of weight confidence is based 
on a probabilistic interpretation of passive-aggressive online learning, which differs from the more 
familiar Bayesian learning for linear classifiers. Nevertheless, analogously to Bayesian learning, 
it can be used to provide a useful notion of prediction confidence through a margin distribution 
(Dredze and Crammer, 2008a,b; Dredze et al., 2010). 


A summary of the notation used throughout this paper appears in Table 1. 
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x; Example on round i 

3; Prediction on round i 

y; Label on round i 

wi Weight vector on round i 

L; The mean of the distribution on round i 

x; The covariance matrix of the distribution on round i 
mi Margin on round i 

v; Margin variance on round i 

n Confidence level 

The free parameter for CW, defined as @ = $7! (n) 











Table 1: A reference table for notation used throughout the paper. 


3. Online Learning of Linear Classifiers 


Online algorithms operate in rounds, where each round corresponds to a single example. On round 
i the algorithm receives an example x; € R? to which it applies its current prediction rule to produce 
a prediction ĵ; E€ {—1,+1} (for binary classification). It then receives the true label y; € {—1,+1} 
and suffers a loss (y;,;), which in this work will be the zero-one loss: ¢(y;,9;) = 1 if y; 4 f; and 
(y;,9;) = 0 otherwise. The algorithm then updates its prediction rule and proceeds to the next 
round. For online evaluations, error is reported as the total loss £ on the training data and in batch 
evaluations, error is reported on held out data. 
As is common in linear classification, our prediction rules are linear threshold functions 


f(x) : f(x) = sign(x-w) . 


Two functions fw and few are the same for non-negative c. Thus, we can identify fẹ with w, which 
we will do in what follows. 

The signed margin of an example (x,y) with respect to a specific classifier w is defined to be 
y(w-x). The sign of the margin is positive iff the classifier w correctly predicts the true label y. 
The absolute value of the margin |y(w -x)| = |w: x| can be thought of as the confidence? in the 
prediction, with larger positive values corresponding to more confident correct predictions. We 
denote the margin at round i by m; = y;(w;- xi). 

A variety of linear classifier training algorithms, including the perceptron and linear support 
vector machines, restrict w to be a linear combination of the input examples. Online algorithms of 
that kind typically have updates of the form 


Wi+1 = Wi + QiyiXi , (1) 


for some non-negative coefficients Oy. 
In this paper we focus on passive-aggressive (PA) updates (Crammer et al., 2006a) for linear 
classifiers. After predicting with w; on the ith round and receiving the true label y;, the algorithm 





2. Note that we use the term “confidence” here as is commonly used in the literature to refer to the size of the margin. 
This should not be confused with the idea of weight confidence used in this work. In fact, while margin size is 
often taken as prediction confidence, such as in active learning (Tong and Koller, 2001), this interpretation is open to 
debate. 
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updates the prediction function such that the example (x;,y;) will be classified correctly with a fixed 
margin (which can always be scaled to 1): 


1 2 
Wi41=min —||w,—w| 

w 2 
st. yj(w-xj) > 1. (2) 
The general form of this problem is to enforce some learning constraints, in this case a prediction 
margin on the example, while minimizing the divergence to the current weights, which are assumed 
to be good since they encapsulate all previously observed examples. Solving this problem leads to 

an update of the form given by (1) with coefficient a; defined on each round as: 

-_ max {1 —y;(wi-xi), 0} 


os 2 
lx: 





, (3) 


Like the perceptron, this is a mistake driven update, whereby a; > 0 iff the learning condition was 
not met, ie. the example was not classified with a margin of at least 1. Note that the numerator of 
(3) is the hinge loss, which is zero only if the example is classified with a margin of 1. In practice, 
slack variables are introduced for non-separable data, restricting (3) as max {a;,C}, for some free 
parameter C. 

Crammer et al. (2006a) provide a theoretical analysis of algorithms of this form, which have 
been shown to work well in a variety of applications (McDonald et al., 2004, 2005a; Chiang et al., 
2008). 


4. Distributions over Classifiers 


Following the motivation of Section 2, we need a notion of confidence for the weight vector w 
maintained by an online learner for linear classifiers. Before any examples are seen, all of the 
weights in w are equally uncertain. As examples are observed, the confidence in the weights of 
features that are often active should increase faster than the confidence in the weights of rarely seen 
features. 

Our concrete implementation of this idea is to represent the state of the learner with a probabil- 
ity density over w, specifically a Gaussian distribution N (u, £) with mean u € R? and covariance 
matrix © € R¢*4, The values Mp and Èp p represent knowledge of and confidence in the weight of 
feature p. The smaller £, p, the more confidence we have in the mean weight value up. Each covari- 
ance term Èp captures our knowledge of the interaction between features p and p’. The Gaussian 
distribution naturally matches our intuition for confidence, as the covariance of the distribution is 
inversely proportional to our confidence: the smaller the determinant of the covariance, the less we 
expect the true weight value to deviate from the current estimate. This Gaussian representation is 
illustrated in Figure 2, which shows a Gaussian distribution over two-dimensional weight vectors. 
The black line represents an example x = (0.5,1), y = +1, which divides the space between clas- 
sifiers that correctly classify this point (blue crosses below) and those that classify it incorrectly 
(green dots above). 

In the CW model, the traditional signed margin y (w - x) becomes a univariate Gaussian random 
variable M, where the mean of the distribution is the signed margin, 


MSN (y(u-x),x" 2x) À (4) 
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Figure 2: Gaussian distribution over two-dimensional weight vectors. Points above the black line 
(green dots) incorrectly classify the example ((0.5, 1), +1) and points below the line (blue 
crosses) classify it correctly. The density around a point is proportional to its relative 
weight. The black circle marks the mean of the Gaussian. 


There are several ways to make predictions in this framework. A Gibbs predictor samples from 
the distribution a single weight vector w, which is equivalent to drawing a margin value using (4), 
and takes its sign as the prediction. Other alternatives use averaging rather than sampling. For 
example, we can use the average weight vector E [w] = u, as is done in Bayes point machines (Her- 
brich et al., 2001), which use a single weight vector to approximate a distribution. Alternatively, we 
can use the average margin E [M]. These two approaches are equivalent by linearity of expectation, 
E|w-x] = u-x. Another approach estimates E{sign(M)] from many draws of w for fixed u,2, and 
x. Since the sign function attains only two values (—1 or +1) this is equivalent to computing the 
probability of a correct prediction (not a large margin prediction), given by 


Pr[M > 0] = Pry yuy) D (wx) > 0]. 


When possible we omit the explicit dependence on the distribution parameters and simply write 
Pr [y(w-x) > 0]. If the probability is larger than half, then the (weighted) majority votes for y = +1, 
otherwise, for y = —1. Note that from the discussion below this prediction rule is equivalent to the 
previous two. Conceptually, it is useful to think of prediction as drawing a weight vector w from the 
distribution, ie. w ~ N (u, }), and predicting the label according to the sign of w - x. However, as we 
said above, the average of many such draws is equivalent to the simple prediction rule sign (u- x), 
which we will use in what follows. 


5. Learning Confidence-Weighted Classifiers 


In the previous section we formalized our confidence-weighted learning framework in terms of 
Gaussian distributions over weight vectors. In this section we discuss how to learn such distribu- 
tions. 
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CW is an online learning algorithm, so on round i the algorithm receives example x; for which 
it issues a prediction ¥;.°> The algorithm predicts f; as sign (u; x;), which is equivalent to averaging 
the predictions of many sampled weight vectors from the distribution. On being presented with the 
label y;, the algorithm adjusts the distribution to enforce a learning condition. Following the intuition 
underlying the PA algorithms of Crammer et al. (2006a), we require that an update achieves both a 
large margin on the example and minimizes the change in weights. In this case, a large prediction 
margin is formalized as ensuring that the probability of a correct prediction for training example i 
is no smaller than the confidence level n € [0, 1]: 


Pr ly; (w-x;) > 0) >n. 


Minimization of weight changed is enforced by finding a new distribution closest in the KL di- 


vergence’ sense to the current distribution A (u;,2;). Thus, on round i, the algorithm updates the 


distribution by solving the following optimization problem: 


(Hi1 Zi+1) = min Dri (N (u, £) || N (u; £:)) (5) 
s.t. Pr[y; (w:x) >20] >n. (6) 


This update can be understood as a probabilistic counterpart of the PA objective (2). 

We now develop both the objective and the constraint of this optimization problem following 
Boyd and Vandenberghe (2004, page 158). We start with the objective (5) and write the KL diver- 
gence between two Gaussians as 


Dxx (N. (uo, Zo) || N (4, £1)) = 


) 
1 det}; = Ta 
5 (is (=) +Tr (2; 'Zo) + (u; — uo) Ly" (u1 — Ho) -a) . 





We now proceed with the constraint in (6). As noted above, under the distribution N (u, £), the 
margin for (x;,;) has a Gaussian distribution with mean 


mi = yi (t; Xi) , (7) 


and variance 
2 T 
Of = v; =X; Xixi. (8) 


Thus the probability of a wrong classification is 





M— of 
Prim <0) =Pr| mI). 
(07 Oo 
Since (M — m) /o is a normally distributed random variable, the above probability equals ® (—m/o), 
where 


(u dv, 


1 u 

SE 
V 2T J—co 

3. For a related batch formulation of CW learning, see recent work of Crammer et al. (2009b). 


4. Di (P(X) als) = J p(x) log (222) ax. 





1898 


CONFIDENCE- WEIGHTED LINEAR CLASSIFICATION FOR TEXT CATEGORIZATION 


is the cumulative Gaussian distribution. Thus we can rewrite (6) as 
= <E (1-9) =- (n). 
Substituting m and o by their definitions and rearranging terms we obtain 
yilu- xi) > O4/ x; Exi , 


where ) = 7! (n). To conclude the update rule solves the following optimization problem: 


det; I 1 
detx ]) 2 





4 ere | 2 
(uis1,Eiv1) = argmin 5 log ( Tr (ZP TE) +5 uw)! Er’ ew) 


s.t. yilu- xj) > o4/ x) Exi . (9) 


Conceptually, this is a large-margin constraint, where the value of the margin requirement depends 
on the example x; via a quadratic form. 


Unfortunately, this constraint is not convex in © since the term ,/x, £x; is concave in £. We 
propose two alternatives to obtain a convex constraint: linearization (Section 5.1) and change of 


variables (Section 5.2). Additionally, we propose few alternatives to solve the learning optimization 
problem restricted to diagonal matrices in Section 6. 


5.1 Linearization of the Constraint 


In out first approach to obtain a convex problem we simply linearize the constraint of (9) by omitting 
the square root to obtain the revised optimization problem. 
det =) 1 


1 
A = Toei 
der) 5 Tt (2; 'Z) + 5 (u—u) X (u-u) 


st. yilu) > $ (x7 2x) (10) 


We call this formulation var, since we have replaced the standard deviation in the constraint with 
the variance. This formulation was introduced by Dredze et al. (2008). The following lemma 
summarizes the solution of this formulation, 





1 
(Mi+1Zi+1) = argmin 5 log ( 


Lemma 1 The optimal solution of this form is, 


Mig, = Mj + OLX; 
Lp) =X l +2abxx) , 


i 





where the value of the parameter & (a Lagrange multiplier) is given by 


(1+ 26m;) + y/(1-+20m,)? — 86 (mi — vi) 
4ov; 








Q; = max < 0, 





where mi = y;(u; + xi) (see (7)) and vi = x} Xixi (see (8). 


The derivation appears in Section 5.1.1 below. The resulting algorithm is shown in Figure 1, 
where the update uses (11) and (13) to update the distribution with coefficients B; ((15)) and a; 
(max{(18), 0}.) 
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Algorithm 1 Binary CW Online Algorithm. The two versions of the Confidence-Weighted algo- 
rithm: (1) linearization and (2) change of variables. The numbers in parentheses refer to equations 
in the text, where more detail can be found. 

Input: n € (0.5, 1] 

Initialize: 





U = 0 ; Y=! ; 
p=0 mM), P =1+0/2,  =140?. 
for i = 1,2... do 
Receive a training example x; € R? 
Compute Gaussian margin distribution m; ~ N (u; - xix] Xixi) 
Receive true label y; 
Suffer loss 4; = 1 iff y;E [sign (m;)] <0 
Compute Update: 


e Define: mi = yi (u; xi) (7) vi = x} Xixi (8) 


e Linearization: 





(1+ 26m) + y/ (1 +20m,)? — 86 (mi — vi) 








i= 0, 18 
ol; = max hoy; (18) 
20:0 
= 15 
p 1+ 2a6v; ee 


e Change of Variables: 


2 
—avid + \/a2v?o? + 4v; 
v= (28) 








i 5 
—m;o' + \/ m? y + v00” 
Q; = max ¢ 0, y (31) 
Vi 


(27) 


p; =“? _ 
lv + vioo 


Update 


Mig = Hi + OViLix; (11,20) 
Lig = X; — Pidixix, £i (14,25) 





end for 
Output: Final u and È 
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5.1.1 DERIVATION OF LEMMA 1 


The optimization objective is convex in u and X simultaneously and the constraint became linear, 
so any convex optimization solver van be used to solve this problem. The Lagrangian for this 
optimization is 





1 det; 1 4 1 TAs 
L = slog (TE) + 5TH X) +5 (=) Xr (u;—u) 
+0 (— (u-xi) +0 (s7Ex) ) ; 


Taking partial derivatives, we know that at the optimum, we must have 


ptT E (u—p;) — Oyix) = 0. 


Assuming ©; is non-singular and rearranging terms we get 





Hig) = Hi + Oyidixi . (11) 
At the optimum, we must also have 
9 leja Tesi T 
al = > H zi + daxjx, =0, (12) 
and solving for =~! we obtain 
be = ae + 2adxix; ; (13) 


Before proceeding, we observe that (13) computes ay as the sum of a rank-one positive semi- 
definite (PSD) matrix and = . Thus, if 5! is PSD, so are Eai and ¥;4; thus X; is indeed non- 
singular, as assumed above. The update guarantees that the eigenvalues of the inverse-covariance 
matrix always increase. 

Finally, we compute the inverse of (13) using the Woodbury identity (Petersen and Pedersen, 
2008, Equation 135) and get 


—1 
Ly = oa +2ax:x/ ) 


—1 
= Xi — Xixi (5 +a E) x Xi 
=} -— Ea TE 
=E pian Ep; (14) 
where 
j= x} Eix 
B= oe (15) 
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The KKT conditions for the optimization imply that the either & = 0, and no update is needed, or 
the constraint in (10) is an equality after the update. Substituting (11) and (14) into the equality 
version of (10), we obtain 


Yi (Xi (uj + OY:Xix:)) = Q (x7 (zı = EixiBix] Zi) xi) (16) 
Rearranging terms we get 
yi (xi uj) + 0x} Dix; = Qx; Lin; — bv? P; - (17) 
Substituting (7), (8), and (15) into (17) we obtain 


a D 
"14200; ` 


mi + Av; = bv; — ov 
We multiply both sides by 1 + 2aov; and get 
(mi + avi) (1 + 2aov;) = ov; (1 + 206v;) — 2007 V7 . 


Rearranging the terms we obtain, 


0 = m+orv,+200vjm; + 207 ov? — ov; 
a? (2v;) + av; (1 + 26m;) + (mi — ov;) . 


The above equality is a quadratic equation in a. Its smaller root is always negative and thus is not a 
valid Lagrange multiplier. Let y; be its larger root: 








— (1+ 26mi) + y/ (1 + 20m)? — 86 (mi — vi) 


40v; oP 


Vi = 


The constraint (10) is satisfied before the update if m; — ov; > 0. If 1+ 20m; < 0, then m; < ov; and 
from (18) we have that y; > 0. If, instead, 1 + 20m; > 0, then, again by (18), we have 


yi > 0 
e y (1+ 20m;)? — 86 (mi — ġvi) > (1+ 2m,) 


= mj < vi. 





From the KKT conditions, either a; = 0 or (10) is satisfied as an equality. In the later case, (16) 
holds, and thus &; = Yy; > 0, which concludes the derivation of the lemma. 
5.2 Change of Variables 


While linearization yielded a closed form convex solution to our optimization, it required approxi- 
mating the constraint. We now proceed with the second alternative of obtaining a convex optimiza- 
tion problem by a change of variables, which allows us to achieve an exact convex update. 
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Since © is positive-semidefinite (PSD) it can be written as the square of another PSD matrix? Y: 
Y= , Y=vY. 


Substituting in (9) gives the revised optimization problem 





1 det Y? 1 seal TAS 
(Hizi; Yi+1) = arg min z log dety2 ) | ait (T 5 (m—u) Yj (ui) 
s.t. yj (u-xi) > || Lill 
Yis PSD. (19) 
Note that, the objective is convex since —logdetY* = —2logdetY which is well defined since Y is 


PSD. The constraint is a second-order cone inequality and therefore convex. 

We call this formulation stdev, since we have maintained the standard deviation in the con- 
straint. This formulation was introduced by Crammer et al. (2008). 

Standard optimization techniques can solve the convex program (19), but these methods can be 
slow. Instead, as before we derive a closed-form solution which we summarize in the following 
lemma: 


Lemma 2 The optimal solution of this form is, 


Hig = Hj + Oy Dix; 
Lig =D) BE E, 





where o 
Qa 
B= ————__ vit = x] Dey : 
4/ v + vi&o 


and the value of the parameter & (a Lagrange multiplier) is given by 


1 —mid! + / m2 + vioo" 


a = max ¢ 0, 


Vj o” 





where mi = yi (u;- xi) (see (7)), vi =x; Xix; (see (8)), and for simplicity we define 6' = 1 +0?°/2 , 6” = 
1+0. 


The resulting algorithm is shown in Figure 1. 


5.2.1 DERIVATION OF LEMMA 2 


The Lagrangian for (19) is 





1 detY;\\ 1 ae ae Ty? 
L= slog aa) trO P) +5 Hm) Tm) +a (yi (em) +l) - 





5. We use a decomposition in terms of PSD matrices because it yields a convex optimization problem. In general, a 
PSD matrix © can be written as E = AA! , which is not convex because it is rotation-invariant. Alternatively, any 
symmetric S matrix can be used £ = SŽ, but this is not convex either, since it is invariant to reflections. 
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At the optimum, it must be that 


0 
Bi i; (u— u) — Qyixi = 0 


Therefore, if Y; is non-singular, the update for the mean is 











Migr = H; + OVA; (20) 
At the optimum, we must also have 
d 1 1 x Y Yxix! 
v = y-! ' e ie ap XiX; o XiX; =0. (21) 
24/x] Y?xi 24/x] Y2x; 
Defining the matrix 
xl 
G= t (22) 
4/ x; Y2x; 
we get 
d 1 1 
L=-Y '+-YC+-CY=0 
oY Ss 2 7 2 


at the optimum. From this, it follows easily that at the optimum 
Y=C 2. 
Substituting (22) into this equation, we obtain the update 


xix} 


[Ty2 5. 
X; Vip Xi 


Conveniently, the final form of the updates can be expressed in terms of the covariance matrix:® 


TH = V7? +06 





Hint = Mit OyiLix; (23) 


xix] 


z 
\/X; Xi+1xi 


As before we observe that if zt is PSD, so are Fai and X;+ı with monotonically decreasing 
eigenvalues. Thus X; is indeed non-singular, as assumed above. 


(24) 








ra = X +00 





6. Furthermore, writing the Lagrangian of (10) and solving it would yield the same solution as Equations (23,24). Thus 
the optimal solution of both (10) and (19) are the same. 


1904 


CONFIDENCE- WEIGHTED LINEAR CLASSIFICATION FOR TEXT CATEGORIZATION 


It remains to determine the value of the Lagrange multiplier &. As before we compute the 
inverse of (24) using the Woodbury identity (Petersen and Pedersen, 2008) to get, 


—1 
xix 


q/x] Eiriki 
-1 
esr ines 
LY; — Xixi } T xl X; 


Liv E! + ap 








ou 
= L; LiXi 4 x Xi 
yx] Dig 1x; +x] Lixo 
= X; REx] Xi. (25) 
where we define 
vi =x] Lip, (26) 


and 


(ao, 27 
pt +v; 


Multiplying (25) by x/ (left) and x; (right) we get 





Fe 
v =v Vi, 


ap 
i— Vi 
4/ v7 + via 


which is equivalent to 


viy/vi +v vab = vin/ vit + v7 ab — vap 


Vi\/ Vi > 





Dividing both sides by 4/ v", we obtain 


+ + = 
v; +4/v; vj —v;=0, 


which can be solved for vi to obtain 


—avid + \/a2v7h? + 4v; 
\/vt = . (28) 


2 





The KKT conditions for the optimization imply that either & = 0 and no update is needed, or the 
constraint (19) is an equality after the update. 
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Using the equality version of (19) and Equations (23,25,26,28) we obtain 


—avid + \/02v?o? + 4v; 


mi + Ov; = 0 5 ; (29) 





which can be rearranged into the following quadratic equation in ©: 


2 
av? (1 +07) + 2amivi (1 + 5) + (m; — vio") =0. 


The smaller root of this equation is always negative and thus not a valid Lagrange multiplier. We 
use the following abbreviations for writing the larger root Y;, 


p =14+97/2 ; o”=14+¢". 


The larger root is then 





—mjv;o! + mavo? =v?" (m? = vid?) 


2 
v o” 





ge | (30) 


The constraint (19) is satisfied before the update if m; — ,/v; = 0. If m; < 0, then m; < 9,/v; and 
from (30) we have that y; > 0. If instead m; > 0, then, again by (30), we have 


y¥; > 0 


& my! < ymo? — vo (m? — vid?) 


=> mi < vi. 





From the KKT conditions, either a; = 0 or (10) is satisfied as an equality, so (29) holds and a; = 
yi > 0. 

The solution of (30) satisfies the KKT conditions, that is either o; > 0 or the constraint of (10) is 
satisfied before the update with the weights u; and &;. We obtain the final form of o; by simplifying 


(30) together with last comment and get, 
4 
1 =m; 4 a/m? Hv?” 
i ae l (31) 


Q; = max ¢ 0, 


vi o” 








6. Diagonal Covariance Matrices 


So far we have said nothing about the covariance matrix Ł, which grows quadratically in the number 
of features. Since our intended applications are NLP tasks, computing the full matrix X is computa- 
tionally infeasible. Additionally, even though we initialize the matrix to be diagonal (Figure 1), after 
applying the updates rule of either (14) (linearization/var) or (25) (change of variables/st dev), we 
may obtain a full covariance matrix, as we subtract from %; a rank-one matrix proportional to the 
outer product of Xx. Therefore, successful applications to NLP problems require a restriction on the 
size of the matrix È. 
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In this section, we reduce the size of © by restriction to a diagonal-covariance matrix.’ We 
discuss two main approaches, each of which can be applied to either the linearization or the change 
of variables formulations. In Section 6.1 we show two ways to use the full-covariance updates 
discussed above and add a diagonalization step. In Section 6.2 we take an alternative approach and 
re-develop the update step assuming an explicit diagonal representation of the covariance matrix. 


6.1 Approximate Diagonal Update 


Both updates above (linearization or change-of-variables) share the same form when updating the 
covariance matrix ((14) or (25)) 
Eji = X; — BL Sy. (32) 


Our diagonalization step will define the final matrix to be a diagonal matrix with its non-zero ele- 
ments equals to the diagonal elements of (32). Formally we get, 
Ei SS dive (z = pawg Es) 
= diag (Z;) — diag (BiZa E) 
= X;—ß;diag (Zaa Ei) , 


where the last equality follows since we assume that X; is diagonal and 


App P=P 
diag (A) = ad 
En te 
A naive implementation of the diagonal operator takes @(d”) time and space. An efficient 
implementation first defines z; = Ł;x; and then sets, 


2 
(2) = (&),,-[(2),] rpd. 


We refer to this diagonalization scheme as Ly since it is equivalent to a projection of the full matrix 
onto the set of diagonal matrices using the Euclidean norm. 

We note in passing that since the diagonalization operator and the inverse operator are not com- 
mutative, we can first diagonalize the inverse of the covariance matrix and then invert the result. 
Concretely we start from the update of the inverse-covariance, 


—1 —1 T 
Yay = yj + NiXiX; š 


where 
Ni = 2046 , 


for the linearization approach ( (13) ) and 
= Od 


Ni = = ; 
\/ X; Li 1%i 


7. There are other possible choices for reducing the matrix size, such as enforcing a sparse block diagonal matrix. We 
select diagonalization since it is the most straightforward reduction and yields a first order model. See a recent paper 
by Ma et al. (2010) for low rank options. 
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for the change of variables approach ((24)). We first diagonalize the inverse-covariance and get 
Baa = diag (E n nixa ) 
diag (es) + diag (nixa) 


= E;!+nidiag (xix) 


As before we implement the update efficiently by writing 


(Eat) = (zr), +n ko, for p=1,...,d, 


or in terms of the covariance matrix 





1 
(Ea) 2 5 for p=1,...,d. 
P,P 


ale) 


We refer to this diagonalization scheme as KL since it is equivalent to a projection of the full matrix 
onto the set of diagonal matrices using the Kullback-Leibler (KL) divergence. 





6.2 Exact Diagonal Update 


An alternative to the approximate formulation is to explicitly maintain a diagonal and develop a 
corresponding update. We now assume that the matrix X is diagonal. We denote by X; (p) the rth 
diagonal element of the matrix ¥;, and by x; (p) the rth element of x;. We start with the first alternative 
above where we used linearization. We follow the derivation of Section 5.1 until (11). Proceeding 
with derivation of (12), but only for the diagonal elements indexed by p we get, 

d 1 1 


L= + +a y =0 for p=1,...,d, 
dX, 22(p)  2Ei,(p) e) 








p) 


Solving for Xp) we get 
_ Zi (p) : 
ee 20;,(p) 9%; (p) 





Li+1,(p) 


Following the logic presented after (15) we get that at the optimum we have 


Xi (p) 


1 + 202; Cp) OX (p) i 





yi (xi (u; + Oy:Zixi)) = OY ip) 
P 


Substituting (7) and (8) and rearranging the terms we get the constraint 


where we defined 





: (33) 
= 1 + 20X; oO (5) 
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We will analyze (33) after developing its equivalent for the second alternative above where we 
perform a change of variables. As above we denote by Y; (p) the rth diagonal element of the matrix 
Y;. We follow the derivation of Section 5.2 until (20). Proceeding with derivation of (21), but only 


for the diagonal elements indexed by r we get 


ð iia (p) p) 
DFi pag e 





ma 





2 
1 1 xi 
2 72 ap (p) 
(P) i(p) AD 


Thus, 


2 /T2. 
R= Yip) x; YX; 


4/ x) Yxi + OXF (V7 (p) 


Multiplying both sides by Xp ) and summing over r we get 


x) Yxi = =4/x; PaL To : 
r fx T Pa Yep 


Vrn g iei 
Xi mo, Ye 


As before we employ the KKT conditions which state that ies a> an we have 








Finally, we obtain 





mi + ov; = 04/ x} Y2x; . 


Substituting in the last equality we get 


OX; (5) y2 
/ x Y2x; 
La Pa ao 


We use again the KKT conditions and get that the optimal value &;+1 is the solution of g(a) = 0 for 








232: 2 
av- È D Xip) Ti) 
i l 2x2 Y2 ` 

r mi + Avi + AD (Vj (,) 


The function g(a) defined in (34) and the function f(a) defined in (33) are both of the form 


ar 
h(a) = mi + Qvi ; 
(o) i i aT 


g(a) =m; 4 (34) 





where v;,a;,c, > 0. The only difference is that b = 1 > 0 in (33) and b = m; in (34). Nevertheless, 


the optimal value of © satisfies h(a;) = 0. The following lemma summarizes few properties of both 
functions: 
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Lemma 3 Assume that v; > 0 and let Li = max{0,—m;/v;}. Both (33) and (34) have the following 
properties: 


1. Their value at L; is non-positive, that is f (Li) <0 ,g(Li) < 0. 
2. They are strictly-increasing for Q > Li 


3. For each function there exists a value U; such that their value at U; is positive, f(U;) > 
0 ,8(Uj) >0 


Proof For the first property we consider two cases m; > 0 and m; < 0. We start with the first case 
and thus L; = 0. Thus, f(0) = m;—¥, Li (pox; (p) = Mi — Ov; < 0, where the last ineqluaity follows 
since we assume that the constraint of (10) does not hold. Also, g(0) = m; — L y+ OXF Tip) = 


mi — ort < 0 since we assumed that the constraint of (19) does not hold. When m; < 0 we have 
Li = —m;/v; > 0. In this case (33) becomes, 





Ei. (p) OX; 
f(Li) = £ P ae < 0 
— 1+ 2LiXi,(p) OX; p) 
since Zi,(p) 9X; (p) > 0. Similarly, 
2,2 +2 
Prot __ 4 


g(Li) = Ł 2.2 2 = <0. 
r LO Ni h 





The second property of strictly-increasing follows immediately since v; > 0 and since for both 
functions the denominator of each term in the sum over p is an increasing function in œ which 
is non-negative in the range & > L;. Finally, the last property follows directly from the second 
property. a 
The lemma states that for each of f and g there is exactly one a; (possibly different for each function) 
such that f(o;) = 0 and g(a;) = 0, but it does not provide an expression for computing Q; explicitly 
such as in Lemma 1. However, it further tells us that for each function the value of Q; is in the 
interval [L;,U;]. A value not far from ©; up to an accuracy of € can be found using binary search in 
time proportional to [(U; — L;) log (1/e)]. 

We conclude this section by computing a possible value U; for each function and start with (33). 
Note that a; = max {0, —2m;/v;} satisfies m; + (a;/2)v; > 0. Thus, bj = max, { (242; (922) vi} 


satisfies bjv;/ (2d) — Xi,(p) 9X; (py/ (1 + 20E; pA?) > 0 for p=1...d. Therefore setting U; = 
max{a;,b;} satisfies f(U;) > 0 as desired. Finally, note that U; > L; since a; > L; by construction. 
For (34) we use the same definition of a; but define b; = max, { (2aY? 6737.5) / vi} and U; = 
max{a;,b;}. By a similar argument we have g(U;) > 0 and L; < U;. 

To summarize, as opposed to the full covariance case, in the exact diagonal case we do not 
compute the value of œ; explicitly, but use a binary-search algorithm to efficiently find a good 
approximation for the optional solution. 


1910 


CONFIDENCE- WEIGHTED LINEAR CLASSIFICATION FOR TEXT CATEGORIZATION 


7. Evaluation 


In this section we evaluate diagonalized versions of the CW algorithm on a range of binary classifi- 
cation problems for NLP tasks. We compare our methods against each other and against competitive 
online and batch learning algorithms. 

We selected a range of 5 tasks and created 17 binary classification problems. We begin with a 
description of each task. 


7.1 20 Newsgroups 


The 20 Newsgroups corpus contains approximately 20,000 newsgroup messages, partitioned across 
20 different newsgroups. The data set is a popular choice for binary and multi-class text classifi- 
cation as well as unsupervised clustering. Following common practice, we created binary problems 
from the data set by creating binary decision problems of choosing between two similar groups. 
Our groups are: 


e comp: comp.sys.ibm.pc.hardware vs. comp.sys.mac. hardware 
e sci: sci.electronics vs. sci.med 
e talk talk.politics.guns vs. talk.politics.mideast 


Each message was represented as a binary bag-of-words. For each problem we selected 1800 ex- 
amples balanced between the two labels. 


7.2 Reuters 


The Reuters Corpus Volume 1 (RCV 1-v2/LYRL2004) contains over 800,000 manually categorized 
newswire stories (Lewis et al., 2004). Each article contains one or more labels describing its gen- 
eral topic, industry and region. We created the following binary decision tasks from the labeled 
documents: 


e Insurance: Life (182002) vs. Non-Life (182003) 
e Business Services: Banking (181000) vs. Financial (183000) 
e Retail Distribution: Specialist Stores (165400) vs. Mixed Retail (165600). 


These distinctions involve neighboring categories so they are fairly hard to make. Details on doc- 
ument preparation and feature extraction are given by Lewis et al. (2004). For each problem we 
selected 2000 examples using a bag-of-words representation with binary features. Each problem 
contains a balanced mixture of examples from each label. 


7.3 Sentiment 


We used a larger version of the sentiment multi-domain data set of Blitzer et al. (2007) used in 
Dredze et al. (2010).° This data consists of product reviews from 7 Amazon domains (apparel, 
book, dvd, electronics, kitchen, music, video). The goal in each domain is to classify a product 





8. Corpus can be found at http: //people.csail.mit.edu/jrennie/20Newsgroups/. 
9. Data set can be found at http: //www.cs. jhu.edu/~mdredze/datasets/sentiment/. 
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review as either positive or negative. Feature extraction creates unigram and bigram features using 
counts following Blitzer et al. (2007). For the apparel domain we used all 1940 examples and for 
all other domains we used 2000 examples. Each problem contains a balanced mixture of example 
labels. 


7.4 Spam 


We include a spam classification problem as a sample problem from the space of email classification 
tasks. We chose spam since it is a widely studied problem with several publicly available data sets. 
We selected the 2006 ECML/PKDD Discovery Challenge spam data set (Bickel, 2006) and use the 
provided representations (bag-of-words). The goal is to classify an email (bag-of-words) as either 
spam or ham (not-spam). This corpus contains two data sets: task A, which has three users, and 
task B, which has 15 users. We use the three users from task A since it has more training examples. 
For each user we select 2000 examples. 


7.5 Pascal 


The PASCAL large scale learning challenge workshop provided several large scale binary data 
sets.!° We selected the NLP task, which is a Webspam filtering problem. Each example is the text 
from a web page. The task is to classify a webpage as either spam or ham. We used the default 
format provided by the workshop and selected 2000 examples. 


7.6 USPS 


The USPS data set contains examples of all 10 digits as part of a digit recognition task (OCR) (Hull, 
1994). We created binary tasks by pairing each digit with another in order: 0/9, 1/2, 3/4, 5/6, 7/8. 
We used the standard value of each pixel in the image, as well as the product of all the pixel pairs 
in the image (bi-grams.) 

Each data set was randomly divided for 10-fold cross validation experiments. Classifier param- 
eters (d) and the number of training iterations (up to 10) were tuned for each classification task on a 
single randomized run over the data. Results are reported for each problem as the average accuracy 
over the 10 folds. Statistical significance is computed using McNemar’s test. 


7.7 Results 


We start by comparing the performance of the diagonalized CW algorithms: var (linearization) 
against stdev (change of variables), approximate against exact diagonalization, and for approxi- 
mate updates, KL against Lz. All six algorithm combinations were run on the data sets described 
above. The average test error on all data sets is shown in Table 2. For each method, we summarize 
its overall performance by computing its mean rank among all the other algorithm: if an algorithm 
has a mean rank of 1 then on average across all data sets it achieved the lowest error on average, 
whereas a rank of 6 indicates that it ranked 6th in error on average across all tests. 

Starting with the KL methods for var and stdev, the st dev method does slightly better, a result 
shown in Crammer et al. (2008). Comparing the two methods for diagonalization (KL vs. L2), while 
Ly does slightly better for var (the best overall), the KL method appears to be more stable overall, 





10. Data sets can be found at http: //largescale.first.fraunhofer.de/workshop/. 
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Figure 3: Accuracy on test data after each iteration on six data sets. 


achieving the best or closest to best results for the var and stdev methods. In comparison, the exact 
methods do worse than the approximations. To understand these results, we examine some learning 
curves from several of the online experiments in Figure 7.6 and in Figure 7.6, which show accuracy 
on test data after each iteration. In many of these plots, the exact method does very well after the first 
iteration, surpassing the performance of the approximate methods. However, after the first iteration 
the exact method stops improving while the approximate methods continue to improve. By finding 
a solution that exactly achieves the constraint, exact produces a more aggressive algorithm that 
learns faster but overfits (Figure 5). In contrast, the approximate solutions do not fully enforce the 
constraint on each update but this slower learning reduces overfitting and improves generalization 
error over several iterations. 


1913 


CRAMMER, DREDZE AND PEREIRA 





















































SP EAE, r E A EE D pa 
S 78 oe ees 
578 5 Ph Tres . 
[8] otraa, =.. Perceptron U76 Ki «+» Perceptron 
< we “e — PA 3j — PA 
76| . Q 
aa — Stdev 74! K — Stdev 
Ps dima Stdev-Exact A Stdev-Exact 
7440" ==. Variance 72l” “=.= Variance 
oe Variance-Exact ve Variance-Exact 
wer z 4 56 7 8 3 107 2 4 56 7 b 310 
Iteration Iteration 
Electronics Kitchen 








Accuracy 
© 
Ss 


~ 
© 


76k, 




















o% 
ES 












hat ath Ree eh ee ee ok ee ARTH 























> 
UV 
© 
en Nn, 582 š 
ar" See.ee**) «=» Perceptron m pss paeeeeoesnedy sa. Perceptron 
p D p 
3 — PA x on — PA 
Fa — Stdev wo — Stdev 
oo l Stdev-Exact C Stdev-Exact 
=.=- Variance 78; e =.=- Variance 
ass Variance-Exact [e «+ Variance-Exact 
T 2 3 a5 6 7 Ti 37 a a 67 E 
Iteration Iteration 
Music Videos 





84, 





Accuracy 
= 
N 





















































We next compare the results from our approximation diagonalization CW methods to other pop- 
ular online learning algorithms (Table 3). We evaluated the perceptron (Rosenblatt, 1958), passive- 
aggressive (Crammer et al., 2006a), stochastic gradient descent (Zhang, 2004; Blitzer et al., 2007) 
and a diagonalized second order perceptron (Cesa-Bianchi et al., 2005), all of which perform well 
for NLP problems. In every experiment, a CW method improved over all of the online learning 
baselines. 

As discussed above, online algorithms are attractive even for batch learning because of their 
simplicity and ability to operate on extremely large data sets. In the batch setting, these algorithms 
are run several times over the training data, which yields slower performance than single pass learn- 
ing (Carvalho and Cohen, 2006). While we have shown that CW improves on accuracy, it also learns 
faster than other baselines, requiring fewer iterations over the training data. Such behavior can be 
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Figure 4: Accuracy on test data after each iteration on the six Amazon data sets. 
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var stdev 
Task KL Ly | Exact KL Ly | Exact 
Sentiment Apparel 12.53 | 12.47 | 14.79 | 13.66 | 13.51 | 14.28 
Books 16.90 | 17.30 | 19.60 | 16.25 | 22.35 | 15.25 
DVD 17.45 | 17.05 | 19.05 | 17.60 | 18.30 | 16.95 


Electronics | 14.95 | 15.40 | 16.65 | 14.75 | 20.95 | 15.50 
Kitchen 13.75 | 13.65 | 15.30 | 15.40 | 15.40 | 14.25 








Music 17.15 | 17.55 | 19.90 | 17.75 | 17.85 | 19.35 
Video 21.75 | 18.55 | 25.85 | 22.50 | 19.00 | 23.60 
ECML Spam A 2.65 | 145 | 3.10} 0.75 | 4.15 | 0.80 
Spam B 1.35 | 1.20 | 2.65 | 1.00] 1.10) 1.05 
Spam C 1.50 | 1.40 | 3.40 | 1.50 | 3.55 |] 1.35 
Reuters Retail 10.55 | 18.80 | 18.75 | 10.25 | 11.05 | 11.05 


Business 16.35 | 15.35 | 17.10 | 16.45 | 16.80 | 17.20 
Insurance 8.20 9.15 | 10.20 8.55 9.55 | 10.10 





20 News Comp 6.69 | 5.61 | 8.59 | 6.79 | 16.64 | 6.90 
Sci 2.44 | 2.74 | 3.20 | 3.04 | 13.35 | 3.10 
Talk 0.86 | 0.43 | 2.43 | 0.27 | 8.38 | 1.14 





Pascal Webspam 3.55 | 3.10 | 3.85 | 2.95 | 3.10] 5.35 
Mean Rank | 2.53 | 2.35 | 5.29 | 2.47 | 4.53 | 3.59 






































Table 2: Average Error of all variants of confidence-weighted algorithms presented in this paper 
over 17 binary text classification tasks. The best score for each data set is set in bold. The 
mean rank is the average rank of each algorithm across data sets, ranging from 1 (best) to 
6. 


seen in Figure 7.6 and Figure 7.6, which shows test error after each training iteration for CW and 
PA. While CW clearly improves over PA, it converges very quickly, reaching near best performance 
on the first iteration. In contrast, PA benefits from multiple iterations over the data; its performance 
changes significantly from the first to fifth iteration. The plot also illustrates exact’s behavior, which 
initially beats PA but does not improve. In fact, on eleven of the twelve data sets, var-Exact beats 
PA on the first iteration. 


7.8 Batch Learning 


While online algorithms are widely used, batch algorithms are still preferred for many tasks. Batch 
algorithms can make global learning decisions by examining the entire data set, an ability beyond 
online algorithms. In general, when batch algorithms can be applied they perform better. We 
compare CW to three standard batch algorithms: naïve Bayes (default configuration in MALLET 
McCallum, 2002), maximum entropy classification (default configuration in MALLET McCallum, 
2002) and support vector machines (LibSVM Chang and Lin, 2001). Classifier parameters (Gaus- 
sian prior for maxent and C for SVM) were tuned as for the online methods. 

Results for batch learning are shown in table Table 4. As expected, the batch methods tend to do 
better than the online methods (perceptron, PA, and SGD). However, in 13 out of 17 tasks the CW 
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var stdev 
Task KL Ly KL Ly Per PA | SOP | SGD 
Sentiment Apparel 12.53 12.47 13.66 13.51 | 17.84 | 13.35 | 17.42 | 13.76 
Books 16.90 17.30 | «16.25 | 22.35 | 23.10 | 18.45 | 19.75 | 18.60 
DVD 17.45 | «17.05 | «17.60 18.30 | 20.45 | 20.95 | 20.55 | 18.70 


Electronics | 14.95 | 15.40 | «14.75 | «20.95 | 18.65 | 17.45 | 20.20 | 16.00 
Kitchen *13.75 | *13.65 | 15.40 | 15.40 | 16.65 | 15.20 | 21.20 | 16.00 








Music 17.15 | 17.55 | 17.75 | 17.85 | 22.35 | 19.40 | 21.80 | 18.20 
Video 21.75 | 18.55 | 22.50 | 19.00 | 21.50 | 18.90 | 21.90 | 19.25 
ECML Spam A 2.65 1.45 | 0.75 4.15 | 3.20}; 1.40) 4.20 | 2.30 
Spam B 1.35 | x1.20 | 1.00 1.10 | 3.00 | 2.40 | 1.95 | 2.80 
Spam C 1.50 | *1.40 | x1.50 3.55 | 3.65 | 2.10) 2.90] 2.15 
Reuters Retail 710.55 | 18.80 | 410.25 | 711.05 | 19.60 | 17.50 | 19.45 | 14.30 


Business 16.35 15.35 16.45 16.80 | 19.00 | 16.15 | 21.80 | 15.45 
Insurance 8.20 9.15 8.55 9.55 | 11.35 | 12.35 | 10.15 9.35 











20 News Comp 6.69 | 5.61 6.79 | 716.64 | 10.30 | 8.65 | 10.45 | 7.88 
Sci {2.44 | 72.74 | 3.04 | ł13.35 | 6.70 | 8.06 | 4.67 | 4.06 
Talk 0.86 | 0.43 | 70.27 | 78.38 | 3.24 | 1.57} 2.59) 1.19 
Pascal Webspam 3.55 3.10 | x2.95 3.10 | 7.60 | 3.90 | 5.05 | 3.50 
USPS 0vs9 0.56 0.56 0.56 0.56 | 0.37 | 0.93 | 0.75 | 0.56 
1 vs 2 1.73 0.87 0.87 4.33 | 1.73 | 0.65 | 2.38 | 42.86 
3 vs 4 1.37 1.09 1.09 1.09 | 0.82 | 1.09 | 2.73 | 45.36 
5 vs 6 0.91 0.91 1.52 1.52 | 4.24 | 0.91 | 3.03 | 48.48 
Tvs 8 2.24 2.24 1.92 2.24 | 2.56) 1.60) 4.15 | 53.04 



































Table 3: Average Error of approximate-diagonal confidence-weighted algorithms and four other 
online algorithms: The perceptron algorithm (Per), the passive-aggressive (PA) algorithm, 
the second order perceptron (SOP) and stochastic gradient decent evaluated using 17 bi- 
nary text classification tasks. The best score for each data set is set in bold. Statistical 
significance measured by McNemar’s test indicates when a CW algorithm is statistically 
significant (x p = 0.05, x p = 0.01, f p = 0.001) from each of the four baselines (percep- 
tron, PA, SOP, SGD). 


algorithm beats all of the batch methods. The much faster and simpler online algorithm performs 
better than the slower more complex batch methods. 

The speed advantage of online methods in the batch setting can be seen in Table 5, which shows 
the average training time in seconds for a single experiment (fold) for a representative selection of 
CW algorithms and some of the baselines. The online times include the multiple iterations selected 
for each online learning experiment. The differences between the online and batch algorithms are 
striking. While CW performs better than the batch methods, it is also much faster, while being 
equivalent in speed to the other online methods. For webspam data, which contains many features, 
an SVM takes over 1.5 minutes to train while the CW algorithms take between 1-2 seconds. 

We also evaluated the effects of commonly used techniques for online and batch learning, in- 
cluding averaging and TFIDF features; they did not improve results so details are omitted. Although 
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var stdev 
Task KL L KL Ly NB | Maxent | SVM 
Sentiment Apparel 12.53 | 12.47 | 13.66 13.51 | 12.63 13.56 | 13.92 
Books 16.90 | 17.30 | «16.25 | 22.35 | 18.40 18.05 | 18.25 
DVD 17.45 | 17.05 17.60 | 18.30 | 21.00 18.00 | 19.60 


Electronics | 14.95 | 15.40 | «14.75 | 20.95 | 17.05 15.85 | 16.25 
Kitchen *13.75 | 13.65 | 15.40 | 15.40 | 15.00 15.25 | 15.50 








Music 17.15 | 17.55 | 17.75 | 17.85 | 18.65 17.90 | 18.25 
Video 21.75 | 18.55 | 22.50 | 19.00 | 22.95 18.40 | 18.80 
ECML Spam A 2.65 1.45 | 0.75 4.15 | 3.70 1.30 | 1.75 
Spam B 1.35 1.20 | «1.00 1.10 | 4.20 1.55 | 1.90 
Spam C 1.50 1.40 1.50 | 73.55 | 1.40 1.35 | 1.40 
Reuters Retail *10.55 | «18.80 | 710.25 | *11.05 | 16.55 12.55 | 12.90 


Business 16.35 15.35 16.45 16.80 | 20.00 15.85 | 15.60 
Insurance 8.20 9.15 8.55 9.55 | 11.80 9.10 9.75 











20 News Comp 6.69 5.61 6.79 | 716.64 | 5.56 7.82 | 7.67 
Sci *2.44 | «2.74 3.04 | 713.35 | 1.42 3.40 | 3.86 
Talk 0.86 | *0.43 | +0.27 | 78.38 | 0.97 1.03 | 1.24 
Pascal Webspam 3.55 | *3.10 | +2.95 | «3.10 | 19.10 6.05 | 3.85 
USPS 0 vs 9 0.56 0.56 0.56 0.56 | 1.12 33.02 | 0.56 
l vs 2 1.73 0.87 0.87 4.33 | 1.52 42.86 | 0.65 
3 vs 4 1.37 1.09 1.09 1.09 1.91 45.36 | 0.55 
5 vs 6 0.91 0.91 1.52 1.52 | 3.03 48.48 | 0.61 
Tvs 8 2.24 2.24 1.92 2.24 | 2.56 53.04 | 0.96 
































Table 4: Average Error of approximate-diagonal confidence-weighted algorithms and three batch 
algorithms: Naive Bayes (NB), Maximum entropy classifier (Maxent) and support vector 
machine (SVM) evaluated using 17 binary text classification tasks. The best score for each 
data set is set in bold. Statistical significance measured by McNemar’s test indicates when 
a CW algorithm is statistically significant (x p = 0.05, x p = 0.01, t p = 0.001) from each 
of the three baselines (NB, Maxent, SVM). 


the above data sets are balanced with respect to labels, we also evaluated the methods on variant 
data sets with unbalanced label distributions, and still saw similar benefits from the CW methods. 


7.9 Large Data Sets 


Online algorithms are especially attractive in tasks where training data exceeds available main mem- 
ory or in streaming settings where training examples cannot be saved. In both of these settings, a 
single sequential pass over the data is highly preferred to multiple passes common in batch training 
cases. So far, we have shown that CW algorithms are more aggressive than other online algorithms, 
an advantage when the algorithm is limited to a single pass. The results is both higher performance 
and fewer training iterations. The question we now answer is whether this advantage is maintained 
in large data settings. 
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Task variance KL | variance Exact | Perceptron PA | Maxent | SVM 
Apparel 0.2 0.2 0.1 | 0.04 1 11 
Books 0.1 0.8 0.1 | 0.1 6 25 
DVD 0.1 0.9 0.1 | 0.04 6 22 
Electronics 0.1 0.4 0.1 | 0.03 2 17 
Kitchen 0.1 0.6 0.1 | 0.1 2 14 
Music 0.1 0.5 0.1 | 0.1 4 19 
Video 0.3 0.7 0.1 | 0.2 6 24 
Spam A 0.03 0.2 0.1 | 0.8 3 3 
Spam B 0.04 0.2 0.1 | 0.1 3 3 
Spam C 0.1 0.2 0.04 | 0.04 1 2 
Retail 0.1 0.1 0.03 | 0.03 0.6 4 
Business 0.1 0.3 0.1 | 0.02 0.3 5 
Insurance 0.1 0.2 0.1 | 0.02 0.8 4 
Comp 0.04 0.2 0.1 | 0.2 2 11 
Sci 0.1 0.3 0.1 | 0.04 2 7 
Talk 0.1 0.3 0.1 | 0.1 4 7 
Webspam 1 2 3 1 12 | 103 





























Table 5: Training times in seconds for a single training run (averaged over 10 trials.) 


We selected two large data sets for evaluation. The combined product reviews for all the domains 
by Blitzer et al. (2007) yield one million sentiment examples. While most reviews were from the 
book domain, the reviews are taken from a wide range of Amazon product types and are mostly 
positive. From the Reuters corpus, we created a one vs. all classification task for the Corporate 
topic label, yielding 804,411 examples of which 381,325 are labeled corporate. For the two data 
sets, we created four random splits each with 10,000 test examples and the remaining examples 
saved for training. Parameters were optimized by training on 5,000 randomly chosen examples. We 
evaluated the CW var-KL algorithm and the passive-aggressive algorithm using a single pass over 
this data. 

The results are shown as horizontal lines in Figure 6. For the Sentiment data, CW maintains 
over a 1% lead when compared to PA. On the Reuters data, the results are reversed with PA having 
the advantage. The difference between these behaviors may be related to the different feature rep- 
resentations used by each data set. The Reuters data contains 288,062 unique features, for a feature 
to document ratio of 0.36. In contrast, the sentiment data contains 13,460,254 unique features, a 
feature to document ratio of 13.33. This means that Reuters features will occur several times during 
training while many sentiment features only once. This may give CW an advantage on Sentiment. 
It is also possible that CW over-fits the Reuters data, something that will be observed in the next set 
of experiments below. 


7.10 Distributed Training 


While faster learning over a data stream is important, not all large data sets can be processed by 
a single processor. Therefore, we looked at the case where many processors are available, each 
with easy access to a fraction of the training data, but where communication between processors 
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Figure 5: The trained model’s accuracy on the training data for fifteen of the data sets for the exact 
diagonal and L2 diagonal approximation Stdev methods. Points above the line indicate 
that the exact algorithm obtained a higher training accuracy than the Ly diagonal method. 
Observe that the exact method almost always obtains a higher training accuracy, and 
is nearly 100% in every case. Coupled with the results on test data, which are worse 
for the exact methods, these results indicate that the exact method overfits the training 
data. These results are typical when comparing the exact algorithms against the diagonal 
approximations. 


is limited. In this setting, we would like an algorithm where individual processors train models on 
their easily accessible data, and then they combine their models. While this often does not perform 
as well as a single model trained on all of the data, it is a cost-effective way of learning from very 
large training sets. 

One simple approach is to combine many trained models by averaging their weights (McDonald 
et al., 2010). However, averaging models trained in parallel assumes that each model has an equally 
accurate estimate of the model weights. This is obviously not the case where different processors 
saw different portions of the data, made different updates, or saw features that other processors did 
not. Rather than taking an average over all models, CW provides a confidence value for each weight, 
allowing for a more intelligent combination of weights from multiple models. 

Since each model is a Gaussian distribution over weights, combining multiple trained CW clas- 
sifiers is equivalent to combining multiple Gaussian distributions. Specifically, we compute the 
combined model by finding the Gaussian that minimizes the total divergence to the set C of Gaus- 
sian distributions (individually trained classifiers) for some divergence operator D: 


min )) D((u,2)||(He,2c)); 


He cEC 


If D is the Euclidean distance, then this is just the average of the individual models. However, we 
can instead rely on the variance estimates of each Gaussian by choosing the KL divergence for D. 
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Figure 6: Results for Reuters (800k) and Sentiment (1000k) averaged over 4 runs. Horizontal lines 
show the test accuracy of a model trained on the entire training set. Vertical bars show 
the performance of n (10, 50, 100) classifiers trained on disjoint sections of the data as 
the average performance, uniform combination, or weighted combination. 


This minimization leads to the following weighted combination of individual model means: 


“1 
S| EZ |] LE i PeR: 
cEC cEC cEC 

We evaluate classifier combination by training n (10, 50, 100) models by dividing the example 
stream into n disjoint parts and report the average performance of each of the n classifiers (average), 
the combined classifier from taking the average of the n sets of weights (L2) and the combination 
using the KL divergence on the test data across 4 randomized runs. 

Average accuracy on the test sets are reported in Figure 6. As stated above, the PA single 
model achieves higher accuracy for Reuters, possibly because of the low feature to document ratio. 
However, combining 10 CW classifiers achieves the best performance. For sentiment, combining 
10 classifiers beats PA but is not as good as a single CW model. In every case, combining the 
classifiers improves over each model individually. On sentiment, the KL combination improves 
over the L2 combination and in Reuters the models are equivalent. For comparison, we show the 
accuracy on the test data for a single run on the CW Variance KL model on sentiment data Figure 7. 
When trained on all of the data and distributed across 10 machines, the classifier loses 1% of its 
performance which, using Figure 7 as a guide, corresponds to using 22% of the training data. 

Finally, we computed the actual run time of both PA and CW on the large data sets to compare 
the speed of each model. While CW is more complex, requiring more computation per example, 
the actual speed is comparable to PA; in all tests the run time of the two algorithms was indistin- 
guishable. 


8. Related Work 


The idea of using weight-specific variable learning rates has a long history in neural-network learn- 
ing (Sutton, 1992), although we do not know of a previous model that specifically models confidence 
in a way that takes into account the frequency of features. 
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Figure 7: Results from CW Variance KL run on the large scale Sentiment data (1000k) averaged 
over 4 runs. Accuracy on test data is measured every 10k training examples to demon- 
strate the improvement with increases in training data. 


Online additive algorithms have a long history, from the perceptron (Rosenblatt, 1958) to more 
recent methods (Kivinen and Warmuth, 1997; Crammer et al., 2006b). Our update has a more 
general form, in which the input vector x; is linearly transformed using the covariance matrix, both 
rotating the input and assigning weight specific learning rates. 

The second order perceptron (SOP) (Cesa-Bianchi et al., 2005) demonstrated that second-order 
techniques can improve first-order online methods. Both SOP and CW maintain second-order in- 
formation. SOP is mistake driven while CW is passive-aggressive. SOP uses the current example in 
the correlation matrix for prediction while CW updates after prediction. A variant of stdev similar 
to SOP follows from our derivation if we fix the Lagrange multiplier in (20) to a predefined value 
Q; = Q, omit the square root, and use a gradient-descent optimization step. Fundamentally, CW 
algorithms have a probabilistic motivation, while the SOP is geometric: replace the ball around an 
example with a refined ellipsoid. Shivaswamy and Jebara (2007) used a similar motivation in batch 
learning. 

Ensemble learning shares the idea of combining multiple classifiers. Gaussian process classifi- 
cation (GPC) maintains a Gaussian distribution over weight vectors (primal) or over regressor val- 
ues (dual). Our algorithm uses a different update criterion than the standard GPC Bayesian updates 
(Rasmussen and Williams, 2006, Chapter 3), avoiding the challenge of approximating posteriors. 
Bayes point machines (Herbrich et al., 2001) maintain a collection of weight vectors consistent with 
the training data, and use the single linear classifier which best represents the collection. Concep- 
tually, the collection is a non-parametric distribution over the weight vectors. Its online version 
(Harrington et al., 2003) maintains a set of weight vectors that are updated simultaneously. The rel- 
evance vector machine (Tipping, 2001) incorporates probability into the dual formulation of SVMs. 
As in our work, the dual parameters are random variables distributed according to a diagonal Gaus- 
sian with example specific variance. The weighted-majority (Littlestone and Warmuth, 1994) algo- 
rithm and later improvements (Cesa-Bianchi et al., 1997) combine the output of multiple arbitrary 
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classifiers, maintaining a multinomial distribution over the experts. We assume linear classifiers as 
experts and maintain a Gaussian distribution over their weight vectors. 

With the growth of available data there is an increasing need for algorithms that process train- 
ing data very efficiently. A similar approach to ours is to train classifiers incrementally (Bordes 
and Bottou, 2005). The extreme case is to use each example once, without repetitions, as in the 
multiplicative update method of Carvalho and Cohen (2006). 

In Bayesian modeling, we note few approaches that use parameterized distributions over weight 
vectors. Borrowing concepts from support vector machines, Jaakkola et al. (1999) developed maxi- 
mum entropy discrimination, which models the generation of examples with one generative model 
for each class. The model consisted of distributions over the weights and over margin thresholds. 
They used Bayesian prediction and set the weights using the maximum-entropy principle. In a 
more recent approach, Minka et al. (2009) proposed using additional virtual vectors to allow more 
expressive power beyond Gaussian prior and posterior. 

Passing the output of a linear model through a logistic function has a long-history in the statis- 
tical literature, and is extensively covered in many textbooks (e.g., Hastie et al., 2001). Platt (1998) 
used similar ideas to convert the output of a support vector machine into probabilistic quantities. 

Since the conference versions of this work were published, a few algorithms reminiscent of CW 
were proposed. Duchi et al. (2010) and McMahan and Streeter (2010) proposed to replace the stan- 
dard Euclidean distance in stochastic gradient decent with general Mahalanobis distance defined 
by the second order information, captured by the instantaneous second order moment. Crammer 
et al. (2009a) proposed to replace the hard constraint enforced by the CW algorithm with a relaxed 
version, formulated using an additional term in the objective function. They call their algorithm 
AROW for adaptive regularization of weight vectors. Orabona and Crammer (2010) proposed later 
a framework for online learning, which contains an algorithm close to AROW as a special case, as 
well as other new algorithms. From a different perspective, Crammer and Lee (2010) proposed a 
microscopic view for learning, that tracks individual weight-vectors as opposed only to their macro- 
scopic quantities, such as mean and covariance. Their algorithm has similar update form as CW 
((11) and (13)), yet with different rates. 

Finally, Shivaswamy and Jebara (2010b,a) proposed to use second order information, or the 
variance in the batch setting where an iid distribution over the examples is assumed. Their algorithm 
both maximizes the (average) margin and at the same time minimizes its variance. Note, that they 
do not maintain a distribution over weight vectors, and the probability space is induced using the 
distribution over training examples. 


9. Conclusion 


We have presented confidence-weighted linear classifiers, a new learning method designed for NLP 
problems based on the notion of weight confidence. The algorithm maintains a distribution over 
weight vectors; online updates both improve the weight estimates and reduce the distribution’s 
variance. Our method improves over both online and batch methods and learns faster on over a 
dozen NLP data sets. Additionally, our new algorithms allow more intelligent classifier combination 
techniques, yielding improved performance in distributed learning. 
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